{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Learning with Shogun"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### By Saurabh Mahindre - <a href=\"https://github.com/Saurabh7\">github.com/Saurabh7</a> as a part of <a href=\"http://www.google-melange.com/gsoc/project/details/google/gsoc2014/saurabh7/5750085036015616\">Google Summer of Code 2014 project</a> mentored by - Heiko Strathmann - <a href=\"https://github.com/karlnapf\">github.com/karlnapf</a> - <a href=\"http://herrstrathmann.de/\">herrstrathmann.de</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we will see how machine learning problems are generally represented and solved in Shogun. As a primer to Shogun's many capabilities, we will see how various types of data and its attributes are handled and also how prediction is done. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. [Introduction](#Introduction)\n",
    "2. [Using datasets](#Using-datasets)\n",
    "3. [Feature representations](#Feature-representations)\n",
    "4. [Labels](#Assigning-labels)\n",
    "5. [Preprocessing data](#Preprocessing-data)\n",
    "6. [Supervised Learning with Shogun's CMachine interface](#supervised)\n",
    "7. [Evaluating performance and Model selection](#Evaluating-performance-and-Model-selection)\n",
    "8. [Example: Regression](#More-predictions:-Regression)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Machine learning concerns the construction and study of systems that can learn from data via exploiting certain types of structure within these. The uncovered patterns are then used to predict future data, or to perform other kinds of decision making. Two main classes (among others) of Machine Learning algorithms are: predictive or [supervised](http://en.wikipedia.org/wiki/Supervised_learning) learning and descriptive or [Unsupervised](http://en.wikipedia.org/wiki/Unsupervised_learning) learning. Shogun provides functionality to address those (and more) problem classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%pylab inline\n",
    "%matplotlib inline\n",
    "import os\nSHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')\n",
    "#To import all Shogun classes\n",
    "from shogun import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a general problem setting for the supervised learning approach, the goal is to learn a mapping from inputs $x_i\\in\\mathcal{X} $ to outputs $y_i \\in \\mathcal{Y}$, given a labeled set of input-output pairs $ \\mathcal{D} = {(x_i,y_i)}^{\\text N}_{i=1} $$\\subseteq \\mathcal{X} \\times \\mathcal{Y}$. Here $ \\mathcal{D}$ is called the training set, and $\\text N$ is the number of training examples. In the simplest setting, each training input $x_i$ is a  $\\mathcal{D}$ -dimensional vector of numbers, representing, say, the height and weight of a person. These are called $\\textbf {features}$, attributes or covariates. In general, however, $x_i$ could be a complex structured object, such as an image.<ul><li>When the response variable $y_i$ is categorical and discrete, $y_i \\in$ {1,...,C} (say male or female) it is a [classification](http://en.wikipedia.org/wiki/Classification_in_machine_learning) problem.</li><li>When it is continuous (say the prices of houses) it is a [regression](http://en.wikipedia.org/wiki/Regression_analysis) problem.</li></ul>\n",
    "For the unsupervised learning\n",
    "approach we are only given inputs,  $\\mathcal{D} = {(x_i)}^{\\text N}_{i=1}$ , and the goal is to find “interesting\n",
    "patterns” in the data. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us consider an example, we have a dataset about various attributes of individuals and we know whether or not they are diabetic. The data reveals certain configurations of attributes that correspond to diabetic patients and others that correspond to non-diabetic patients. When given a set of attributes for a new patient, the goal is to predict whether the patient is diabetic or not. This type of learning problem falls under [Supervised learning](http://en.wikipedia.org/wiki/Supervised_learning), in particular, [classification](http://en.wikipedia.org/wiki/Classification_in_machine_learning)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Shogun provides the capability to load datasets of different formats using [CFile](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CFile.html).</br>  A real world dataset: [Pima Indians Diabetes data set](http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes) is used now. We load the `LibSVM` format file using Shogun's [LibSVMFile](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLibSVMFile.html) class. The `LibSVM` format is: $$\\space \\text {label}\\space \\text{attribute1:value1 attribute2:value2 }...$$$$\\space.$$$$\\space .$$ LibSVM uses the so called \"sparse\" format where zero values do not need to be stored."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Load the file\n",
    "data_file=LibSVMFile(os.path.join(SHOGUN_DATA_DIR, 'uci/diabetes/diabetes_scale.svm'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This results in a [LibSVMFile](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLibSVMFile.html) object which we will later use to access the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature representations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get off the mark, let us see how Shogun handles the attributes of the data using [CFeatures](http://shogun-toolbox.org/api/latest/classshogun_1_1CFeatures.html) class. Shogun supports wide range of feature representations. We believe it is a good idea to have different forms of data, rather than converting them all into matrices. Among these are: $\\hspace {20mm}$<ul><li>[String features](http://www.shogun-toolbox.org/doc/en/current/classshogun_1_1CStringFeatures.html): Implements a list of strings. Not limited to character strings, but could also be sequences of floating point numbers etc. Have varying dimensions. </li> <li>[Dense features](http://www.shogun-toolbox.org/doc/en/current/classshogun_1_1CDenseFeatures.html): Implements dense feature matrices</li> <li>[Sparse features](http://www.shogun-toolbox.org/doc/en/current/classshogun_1_1CSparseFeatures.html):  Implements sparse matrices.</li><li>[Streaming features](http://shogun-toolbox.org/doc/en/latest/classshogun_1_1CStreamingFeatures.html): For algorithms working on data streams (which are too large to fit into memory) </li></ul> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`SpareRealFeatures` (sparse features handling `64 bit float` type data) are used to get the data from the file. Since `LibSVM` format files have labels included in the file, `load_with_labels` method of `SpareRealFeatures` is used. In this case it is interesting to play with two attributes, Plasma glucose concentration and Body Mass Index (BMI) and try to learn something about their relationship with the disease. We get hold of the feature matrix using `get_full_feature_matrix` and row vectors 1 and 5 are extracted. These are the attributes we are interested in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "f=SparseRealFeatures()\n",
    "trainlab=f.load_with_labels(data_file)\n",
    "mat=f.get_full_feature_matrix()\n",
    "\n",
    "#exatract 2 attributes\n",
    "glucose_conc=mat[1]\n",
    "BMI=mat[5]\n",
    "\n",
    "#generate a numpy array\n",
    "feats=array(glucose_conc)\n",
    "feats=vstack((feats, array(BMI)))\n",
    "print(feats, feats.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In numpy, this is a matrix of 2 row-vectors of dimension 768. However, in Shogun, this will be a matrix of 768 column vectors of dimension 2. This is beacuse each data sample is stored in a column-major fashion, meaning each column here corresponds to an individual sample and each row in it to an atribute like BMI, Glucose concentration etc. To convert the extracted matrix into Shogun format, `RealFeatures` are used which are nothing but the above mentioned [Dense features](http://www.shogun-toolbox.org/doc/en/current/classshogun_1_1CDenseFeatures.html) of `64bit Float` type. To do this call `RealFeatures` with the  matrix (this should be a 64bit 2D numpy array) as the argument. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#convert to shogun format\n",
    "feats_train=RealFeatures(feats)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the general methods you might find useful are:\n",
    "\n",
    "* `get_feature_matrix()`: The feature matrix can be accessed using this.\n",
    "* `get_num_features()`: The total number of attributes can be accesed using this.\n",
    "* `get_num_vectors()`: To get total number of samples in data.\n",
    "* `get_feature_vector()`: To get all the attribute values (A.K.A [feature vector](http://en.wikipedia.org/wiki/Feature_vector)) for a particular sample by passing the index of the sample as argument.</li></ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Get number of features(attributes of data) and num of vectors(samples)\n",
    "feat_matrix=feats_train.get_feature_matrix()\n",
    "num_f=feats_train.get_num_features()\n",
    "num_s=feats_train.get_num_vectors()\n",
    "\n",
    "print('Number of attributes: %s and number of samples: %s' %(num_f, num_s))\n",
    "print('Number of rows of feature matrix: %s and number of columns: %s' %(feat_matrix.shape[0], feat_matrix.shape[1]))\n",
    "print('First column of feature matrix (Data for first individual):')\n",
    "print(feats_train.get_feature_vector(0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assigning labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In supervised learning problems, training data is labelled. Shogun provides various types of labels to do this through [Clabels](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLabels.html). Some of these are:<ul><li>[Binary labels](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CBinaryLabels.html): Binary Labels for binary classification which can have values +1 or -1.</li><li>[Multiclass labels](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CMulticlassLabels.html): Multiclass Labels for multi-class classification which can have values from 0 to (num. of classes-1).</li><li>[Regression labels](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CRegressionLabels.html):  Real-valued labels used for regression problems and are returned as output of classifiers.</li><li>[Structured labels](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CStructuredLabels.html): Class of the labels used in Structured Output (SO) problems</li></ul></br> In this particular problem, our data can be of two types: diabetic or non-diabetic, so we need [binary labels](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CBinaryLabels.html). This makes it a [Binary Classification problem](http://en.wikipedia.org/wiki/Binary_classification), where the data has to be classified in two groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#convert to shogun format labels\n",
    "labels=BinaryLabels(trainlab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The labels can be accessed using `get_labels` and the confidence vector using `get_values`. The total number of labels is available using `get_num_labels`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "n=labels.get_num_labels()\n",
    "print('Number of labels:', n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocessing data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is usually better to preprocess data to a standard form rather than handling it in raw form. The reasons are having a well behaved-scaling, many algorithms assume centered data, and that sometimes one wants to de-noise data (with say PCA). Preprocessors do not change the domain of the input features. It is possible to do various type of preprocessing using methods provided by [CPreprocessor](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CPreprocessor.html) class. Some of these are:<ul><li>[Norm one](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CNormOne.html): Normalize vector to have norm 1.</li><li>[PruneVarSubMean](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CPruneVarSubMean.html):  Substract the mean and remove features that have zero variance. </li><li>[Dimension Reduction](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDimensionReductionPreprocessor.html): Lower the dimensionality of given simple features.<ul><li>[PCA](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CPCA.html): Principal component analysis.</li><li>[Kernel PCA](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKernelPCA.html): PCA using kernel methods.</li></ul></li></ul> The training data will now be preprocessed using [CPruneVarSubMean](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CPruneVarSubMean.html). This will basically remove data with zero variance and subtract the mean. Passing a `True` to the constructor makes the class normalise the varaince of the variables. It basically dividies every dimension through its standard-deviation. This is the reason behind removing dimensions with constant values.  It is required to initialize the preprocessor by passing the feature object to `init` before doing anything else. The raw and processed data is now plotted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "preproc=PruneVarSubMean(True)\n",
    "preproc.init(feats_train)\n",
    "feats_train.add_preprocessor(preproc)\n",
    "feats_train.apply_preprocessor()\n",
    "\n",
    "# Store preprocessed feature matrix.\n",
    "preproc_data=feats_train.get_feature_matrix()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Plot the raw training data.\n",
    "figure(figsize=(13,6))\n",
    "pl1=subplot(121)\n",
    "gray()\n",
    "_=scatter(feats[0, :], feats[1,:], c=labels, s=50)\n",
    "vlines(0, -1, 1, linestyle='solid', linewidths=2)\n",
    "hlines(0, -1, 1, linestyle='solid', linewidths=2)\n",
    "title(\"Raw Training Data\")\n",
    "_=xlabel('Plasma glucose concentration')\n",
    "_=ylabel('Body mass index')\n",
    "p1 = Rectangle((0, 0), 1, 1, fc=\"w\")\n",
    "p2 = Rectangle((0, 0), 1, 1, fc=\"k\")\n",
    "pl1.legend((p1, p2), [\"Non-diabetic\", \"Diabetic\"], loc=2)\n",
    "\n",
    "#Plot preprocessed data.\n",
    "pl2=subplot(122)\n",
    "_=scatter(preproc_data[0, :], preproc_data[1,:], c=labels, s=50)\n",
    "vlines(0, -5, 5, linestyle='solid', linewidths=2)\n",
    "hlines(0, -5, 5, linestyle='solid', linewidths=2)\n",
    "title(\"Training data after preprocessing\")\n",
    "_=xlabel('Plasma glucose concentration')\n",
    "_=ylabel('Body mass index')\n",
    "p1 = Rectangle((0, 0), 1, 1, fc=\"w\")\n",
    "p2 = Rectangle((0, 0), 1, 1, fc=\"k\")\n",
    "pl2.legend((p1, p2), [\"Non-diabetic\", \"Diabetic\"], loc=2)\n",
    "gray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Horizontal and vertical lines passing through zero are included to make the processing of data clear. Note that the now processed data has zero mean."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <a id='supervised'>Supervised Learning with Shogun's <a href='http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CMachine.html'>CMachine</a> interface</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[CMachine](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CMachine.html) is Shogun's interface for general learning machines. Basically one has to ` train()`  the machine on some training data to be able to learn from it. Then we `apply()` it to test data to get predictions. Some of these are: <ul><li>[Kernel machine](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKernelMachine.html): Kernel based learning tools.</li><li>[Linear machine](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLinearMachine.html): Interface for all kinds of linear machines like classifiers.</li><li>[Distance machine](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CDistanceMachine.html): A distance machine is based on a a-priori choosen distance.</li><li>[Gaussian process machine](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGaussianProcessMachine.html): A base class for Gaussian Processes. </li><li>And many more</li></ul>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Moving on to the prediction part, [Liblinear](http://www.shogun-toolbox.org/doc/en/current/classshogun_1_1CLibLinear.html), a linear SVM is used to do the classification (more on SVMs in [this notebook](http://www.shogun-toolbox.org/static/notebook/current/SupportVectorMachines.html)). A linear SVM will find a linear separation with the largest possible margin. Here C is a penalty parameter on the loss function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#prameters to svm\n",
    "C=0.9\n",
    "\n",
    "svm=LibLinear(C, feats_train, labels)\n",
    "svm.set_liblinear_solver_type(L2R_L2LOSS_SVC)\n",
    "\n",
    "#train\n",
    "svm.train()\n",
    "\n",
    "size=100\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now `apply` on test features to get predictions. For visualising the classification boundary, the whole XY is used as test data, i.e. we predict the class on every point in the grid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "x1=linspace(-5.0, 5.0, size)\n",
    "x2=linspace(-5.0, 5.0, size)\n",
    "x, y=meshgrid(x1, x2)\n",
    "#Generate X-Y grid test data\n",
    "grid=RealFeatures(array((ravel(x), ravel(y))))\n",
    "\n",
    "#apply on test grid\n",
    "predictions = svm.apply(grid)\n",
    "#get output labels\n",
    "z=predictions.get_values().reshape((size, size))\n",
    "\n",
    "#plot\n",
    "jet()\n",
    "figure(figsize=(9,6))\n",
    "title(\"Classification\")\n",
    "c=pcolor(x, y, z)\n",
    "_=contour(x, y, z, linewidths=1, colors='black', hold=True)\n",
    "_=colorbar(c)\n",
    "\n",
    "_=scatter(preproc_data[0, :], preproc_data[1,:], c=trainlab, cmap=gray(), s=50)\n",
    "_=xlabel('Plasma glucose concentration')\n",
    "_=ylabel('Body mass index')\n",
    "p1 = Rectangle((0, 0), 1, 1, fc=\"w\")\n",
    "p2 = Rectangle((0, 0), 1, 1, fc=\"k\")\n",
    "legend((p1, p2), [\"Non-diabetic\", \"Diabetic\"], loc=2)\n",
    "gray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us have a look at the weight vector of the separating hyperplane. It should tell us about the linear relationship between the features. The decision boundary is now plotted by solving for $\\bf{w}\\cdot\\bf{x}$ + $\\text{b}=0$. Here $\\text b$ is a bias term which allows the linear function to be offset from the origin of the used coordinate system. Methods `get_w()` and `get_bias()` are used to get the necessary values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "w=svm.get_w()\n",
    "b=svm.get_bias()\n",
    "\n",
    "x1=linspace(-2.0, 3.0, 100)\n",
    "\n",
    "#solve for w.x+b=0\n",
    "def solve (x1):\n",
    "    return -( ( (w[0])*x1 + b )/w[1] )\n",
    "x2=list(map(solve, x1))\n",
    "\n",
    "#plot\n",
    "figure(figsize=(7,6))\n",
    "plot(x1,x2, linewidth=2)\n",
    "title(\"Decision boundary using w and bias\")\n",
    "_=scatter(preproc_data[0, :], preproc_data[1,:], c=trainlab, cmap=gray(), s=50)\n",
    "_=xlabel('Plasma glucose concentration')\n",
    "_=ylabel('Body mass index')\n",
    "p1 = Rectangle((0, 0), 1, 1, fc=\"w\")\n",
    "p2 = Rectangle((0, 0), 1, 1, fc=\"k\")\n",
    "legend((p1, p2), [\"Non-diabetic\", \"Diabetic\"], loc=2)\n",
    "\n",
    "print('w :', w)\n",
    "print('b :', b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this problem, a linear classifier does a reasonable job in distinguishing labelled data. An interpretation could be that individuals below a certain level of BMI and glucose are likely to have no Diabetes. \n",
    "For problems where the data cannot be separated linearly, there are more advanced classification methods, as for example all of Shogun's [kernel machines](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CKernelMachine.html), but more on this later. To play with this interactively have a look at this: [web demo](http://demos.shogun-toolbox.org/classifier/binary/) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating performance and Model selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How do you assess the quality of a prediction? Shogun provides various ways to do this using [CEvaluation](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CEvaluation.html). The preformance is evaluated by comparing the predicted output and the expected output. Some of the base classes for performance measures are:\n",
    "\n",
    "* [Binary class evaluation](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CBinaryClassEvaluation.html): used to evaluate binary classification labels. \n",
    "* [Clustering evaluation](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CClusteringEvaluation.html): used to evaluate clustering.\n",
    "* [Mean absolute error](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CMeanAbsoluteError.html): used to compute an error of regression model.\n",
    "* [Multiclass accuracy](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CMulticlassAccuracy.html): used to compute accuracy of multiclass classification. \n",
    "\n",
    "Evaluating on training data should be avoided since the learner may adjust to very specific random features of the training data which are not very important to the general relation. This is called [overfitting](http://en.wikipedia.org/wiki/Overfitting). Maximising performance on the training examples usually results in algorithms explaining the noise in data (rather than actual patterns), which leads to bad performance on unseen data. The dataset will now be split into two, we train on one part and evaluate performance on other using [CAccuracyMeasure](http://shogun-toolbox.org/api/latest/classshogun_1_1CAccuracyMeasure.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#split features for training and evaluation\n",
    "num_train=700\n",
    "feats=array(glucose_conc)\n",
    "feats_t=feats[:num_train]\n",
    "feats_e=feats[num_train:]\n",
    "feats=array(BMI)\n",
    "feats_t1=feats[:num_train]\n",
    "feats_e1=feats[num_train:]\n",
    "feats_t=vstack((feats_t, feats_t1))\n",
    "feats_e=vstack((feats_e, feats_e1))\n",
    "\n",
    "feats_train=RealFeatures(feats_t)\n",
    "feats_evaluate=RealFeatures(feats_e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see the accuracy by applying on test features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "label_t=trainlab[:num_train]\n",
    "labels=BinaryLabels(label_t)\n",
    "label_e=trainlab[num_train:]\n",
    "labels_true=BinaryLabels(label_e)\n",
    "\n",
    "svm=LibLinear(C, feats_train, labels)\n",
    "svm.set_liblinear_solver_type(L2R_L2LOSS_SVC)\n",
    "\n",
    "#train and evaluate\n",
    "svm.train()\n",
    "output=svm.apply(feats_evaluate)\n",
    "\n",
    "#use AccuracyMeasure to get accuracy\n",
    "acc=AccuracyMeasure()\n",
    "acc.evaluate(output,labels_true)\n",
    "accuracy=acc.get_accuracy()*100\n",
    "print('Accuracy(%):', accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To evaluate more efficiently [cross-validation](http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29) is used. As you might have wondered how are the parameters of the classifier selected? Shogun has a model selection framework to select the best parameters. More description of these things in [this notebook](http://shogun-toolbox.org/notebook/latest/xval_modelselection.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### More predictions: Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section will demonstrate another type of machine learning problem on real world data.</br> The task is to estimate prices of houses in Boston using the [Boston Housing Dataset](https://archive.ics.uci.edu/ml/datasets/Housing) provided by [StatLib library](http://lib.stat.cmu.edu/DASL/). The attributes are: Weighted distances to employment centres and percentage lower status of the population. Let us see if we can predict a good relationship between the pricing of houses and the attributes. This type of problems are solved using [Regression analysis](http://en.wikipedia.org/wiki/Regression_analysis)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data set is now loaded using [LibSVMFile](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CLibSVMFile.html) as in the previous sections and the attributes required (7th and 12th vector ) are converted to Shogun format features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "temp_feats=RealFeatures(CSVFile(os.path.join(SHOGUN_DATA_DIR, 'uci/housing/fm_housing.dat')))\n",
    "labels=RegressionLabels(CSVFile(os.path.join(SHOGUN_DATA_DIR, 'uci/housing/housing_label.dat')))\n",
    "\n",
    "#rescale to 0...1\n",
    "preproc=RescaleFeatures()\n",
    "preproc.init(temp_feats)\n",
    "temp_feats.add_preprocessor(preproc)\n",
    "temp_feats.apply_preprocessor(True)\n",
    "mat = temp_feats.get_feature_matrix()\n",
    "\n",
    "dist_centres=mat[7]\n",
    "lower_pop=mat[12]\n",
    "\n",
    "feats=array(dist_centres)\n",
    "feats=vstack((feats, array(lower_pop)))\n",
    "print(feats, feats.shape)\n",
    "#convert to shogun format features\n",
    "feats_train=RealFeatures(feats)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The tool we will use here to perform regression is [Kernel ridge regression](http://shogun-toolbox.org/doc/en/latest/classshogun_1_1CKernelRidgeRegression.html). Kernel Ridge Regression is a non-parametric version of ridge regression where the [kernel trick](http://en.wikipedia.org/wiki/Kernel_trick) is used to solve a related linear ridge regression problem in a higher-dimensional space, whose results correspond to non-linear regression in the data-space.  Again we train on the data and apply on the XY grid to get predicitions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "size=100\n",
    "x1=linspace(0, 1.0, size)\n",
    "x2=linspace(0, 1.0, size)\n",
    "x, y=meshgrid(x1, x2)\n",
    "#Generate X-Y grid test data\n",
    "grid=RealFeatures(array((ravel(x), ravel(y))))\n",
    "\n",
    "#Train on data(both attributes) and predict\n",
    "width=1.0\n",
    "tau=0.5\n",
    "kernel=GaussianKernel(feats_train, feats_train, width)\n",
    "krr=KernelRidgeRegression(tau, kernel, labels)\n",
    "krr.train(feats_train)\n",
    "kernel.init(feats_train, grid)\n",
    "out = krr.apply().get_labels()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `out` variable now contains a relationship between the attributes. Below is an attempt to establish such relationship between the attributes individually. Separate feature instances are created for each attribute. You could skip the code and have a look at the plots directly if you just want the essence.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#create feature objects for individual attributes.\n",
    "feats_test=RealFeatures(x1.reshape(1,len(x1)))\n",
    "feats_t0=array(dist_centres)\n",
    "feats_train0=RealFeatures(feats_t0.reshape(1,len(feats_t0)))\n",
    "feats_t1=array(lower_pop)\n",
    "feats_train1=RealFeatures(feats_t1.reshape(1,len(feats_t1)))\n",
    "\n",
    "#Regression with first attribute\n",
    "kernel=GaussianKernel(feats_train0, feats_train0, width)\n",
    "krr=KernelRidgeRegression(tau, kernel, labels)\n",
    "krr.train(feats_train0)\n",
    "kernel.init(feats_train0, feats_test)\n",
    "out0 = krr.apply().get_labels()\n",
    "\n",
    "#Regression with second attribute \n",
    "kernel=GaussianKernel(feats_train1, feats_train1, width)\n",
    "krr=KernelRidgeRegression(tau, kernel, labels)\n",
    "krr.train(feats_train1)\n",
    "kernel.init(feats_train1, feats_test)\n",
    "out1 = krr.apply().get_labels()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Visualization of regression\n",
    "fig=figure(figsize(20,6))\n",
    "#first plot with only one attribute\n",
    "fig.add_subplot(131)\n",
    "title(\"Regression with 1st attribute\")\n",
    "_=scatter(feats[0, :], labels.get_labels(), cmap=gray(), s=20)\n",
    "_=xlabel('Weighted distances to employment centres ')\n",
    "_=ylabel('Median value of homes')\n",
    "\n",
    "_=plot(x1,out0, linewidth=3)\n",
    "\n",
    "#second plot with only one attribute\n",
    "fig.add_subplot(132)\n",
    "title(\"Regression with 2nd attribute\")\n",
    "_=scatter(feats[1, :], labels.get_labels(), cmap=gray(), s=20)\n",
    "_=xlabel('% lower status of the population')\n",
    "_=ylabel('Median value of homes')\n",
    "_=plot(x1,out1, linewidth=3)\n",
    "\n",
    "#Both attributes and regression output\n",
    "ax=fig.add_subplot(133, projection='3d')\n",
    "z=out.reshape((size, size))\n",
    "gray()\n",
    "title(\"Regression\")\n",
    "ax.plot_wireframe(y, x, z, linewidths=2, alpha=0.4)\n",
    "ax.set_xlabel('% lower status of the population')\n",
    "ax.set_ylabel('Distances to employment centres ')\n",
    "ax.set_zlabel('Median value of homes')\n",
    "ax.view_init(25, 40)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
